本文所做数据处理为计算entropy(熵),应用简易的Rule Learning(规则学习)算法。
所用数据为:
sailing-custom-python.tab
zoo-python.tab
1. Import 各种包 1 2 3 import pandas as pdimport numpy as npimport math
2. 用pandas包load数据 1 2 3 sailData = pd.read_table('sailing-custom-python.tab' ) zooData = pd.read_table('zoo-python.tab' ) zooData = zooData.drop(columns='name' )
3. 计算entropy(熵)的方法 公式参考:
1 2 3 4 5 6 7 8 def entropy (data, target) : count = pd.value_counts(data[target]) dataSize = data[target].size entropyValue = 0 for value in count: proportion = value/dataSize entropyValue -= proportion * math.log(proportion, 2 ) return entropyValue
测试方法体是否能运行
1 entropy(sailData, 'Sail' )
1 entropy(zooData, 'type' )
输出:
0.9975025463691153
2.390559682294039
方法正常执行
4. 计算最多数的col名,并返回 1 2 3 4 def majority_class (data, targetClass) : counts = pd.value_counts(data[targetClass]) max_name = counts.idxmax() return max_name
5. 简易规则学习方法 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 def simpler_rule_learner (data, target) : while data.shape[0 ] > 0 : if entropy(data, target) == 0 : print ("otherwise =>" , majority_class(data,target)) data = data.iloc[0 :0 ] else : best_entropy = entropy(data, target) best_attribute = '' best_value = '' best_data=data for attribute in data: for value in data[attribute]: data2 = data.loc[data[attribute]==value] if entropy(data2, target) < best_entropy: best_entropy = entropy(data2, target) best_attribute = attribute best_value = value best_data=data2 print(best_attribute, "=" , best_value, "=>" , majority_class(best_data,target)) data = data.loc[data[best_attribute] != best_value]
测试方法:
1 simpler_rule_learner(sailData, 'Sail' )
1 2 3 4 5 Company = big => yes Outlook = rainy => no Company = med => yes Sailboat = small => yes otherwise => no
1 simpler_rule_learner(zooData, 'type' )
1 2 3 4 5 6 7 8 9 10 11 12 13 feathers = Yes => bird milk = Yes => mammal hair = Yes => insect airborne = Yes => insect fins = Yes => fish legs = 8.0 => invertebrate eggs = No => reptile breathes = No => invertebrate aquatic = Yes => amphibian predator = Yes => reptile backbone = Yes => reptile legs = 6.0 => insect otherwise => invertebrate
至此简易规则学习方法已经可以正确输出结果。
注:筛选某一列中值为特定的行,方法如下 (data.loc用法) 1 2 3 4 5 print(sailData) print() attribute = 'Outlook' value = 'rainy' print(sailData.loc[sailData[attribute]==value])
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 Outlook Company Sailboat Sail 0 rainy big big yes 1 rainy big small yes 2 rainy med big no 3 rainy med small no 4 sunny big big yes 5 sunny big small yes 6 sunny med big yes 7 sunny med big yes 8 sunny med small yes 9 sunny no small yes 10 sunny no big no 11 rainy med big no 12 rainy no big no 13 rainy no big no 14 rainy no small no 15 rainy no small no 16 sunny big big yes Outlook Company Sailboat Sail 0 rainy big big yes 1 rainy big small yes 2 rainy med big no 3 rainy med small no 11 rainy med big no 12 rainy no big no 13 rainy no big no 14 rainy no small no 15 rainy no small no
以上。
若你觉得我的文章对你有帮助,欢迎点击上方按钮对我打赏